Learning from minimum entropy queries in a large committee machine 
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In supervised learning, the redundancy contained in random examples can be avoided by learning 
from queries. Using statistical mechanics, we study learning from minimum entropy queries in a 
large tree-committee machine. The generalization error decreases exponentially with the number 
of training examples, providing a significant improvement over the algebraic decay for random 
examples. The connection between entropy and generalization error in multi-layer networks is 
discussed, and a computationally cheap algorithm for constructing queries is suggested and analysed. 
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In supervised learning of input-output mappings, the 
traditional approach has been to study generalization 
from random examples. However, random examples con- 
tain redundant information, and generalization perfor- 
mance can thus be improved by query learning, where 
each new training input is selected on the basis of the 
existing training data to be most 'useful' in some speci- 
fied sense. In this paper, we consider minimum entropy 
queries, defined by maximizing the most common mea- 
sure of 'usefulness', namely, the expected entropy de- 
crease (or information gain). In order to achieve opti- 
mal generalization performance, the theoretically opti- 
mal choice of queries would of course be based on a direct 
minimization of the generalization error, and not on max- 
imization of the entropy decrease. However, the general- 
ization error is not in general accessible as an objective 
function for query selection, while the expected entropy 
decrease of a query can often be determined fairly eas- 
ily. Since decrease in entropy and generalization error are 
normally correlated (see, e.g., Refs. ||l|,|§l), minimizing en- 
tropy therefore provides a practical method for achieving 
near-optimal generalization performance by query learn- 
ing. 

The generalization performance achieved by minimum 
entropy queries is by now well understood for single- 
layer neural networks such as linear and binary percep- 
trons For multi-layer networks, which are much 

more widely used in practical applications, several heuris- 
tic algorithms for query learning have been proposed 
(see e.g., Refs. While such heuristic approaches 

can demonstrate the power of query learning, they are 
hard to generalize to situations other than the ones for 
which they have been designed, and they cannot eas- 
ily be compared with more traditional techniques for 
query selection such as optimal experimental design. Fur- 
thermore, the existing analyses of such algorithms have 
been carried out within the framework of PAC (proba- 
bly approximately correct) learning, yielding worst case 
bounds which do not necessarily represent average case 
behaviour. In this paper we therefore analyse the average 
generalization performance achieved by query learning in 
a multi-layer network, using the tools of statistical me- 



chanics. 

We focus on one of the simplest multi-layer neural net- 
works, namely, the tree-committee machine (TCM). A 
TCM is a two-layer network with N input units, K hid- 
den units and one output unit. The 'receptive fields' of 
the individual hidden units do not overlap, and all the 
weights from the hidden to the output layer are fixed to 
one. The output y for a given input vector x is therefore 



y = sgn 
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where the ai are the outputs of the hidden units, Wi their 
weight vectors and x"^ — {xj , . . . , x^) with containing 
the N/K real-valued inputs which hidden unit i receives. 
The N components of the K (A^/i4r)-dimensional hidden 
unit weight vectors w^, which we denote collectively by 
w, form the adjustable parameters of a TCM. Without 
loss of generality, the weight vectors are assumed to be 
normalized to w| = N/K, corresponding roughly to in- 
dividual weights of 0(I). 

As our training algorithm we take (zero temperature) 
Gibbs learning, which generates at random any TCM (in 
the following referred to as a 'student') which predicts 
all the training outputs in a given set of p training ex- 
amples Q^P^ — {(x^,j/^),/x = 1 . . .p} correctly. We take 
the problem to be perfectly learnable, which means that 
the outputs y'^ corresponding to the inputs x'^ are gen- 
erated by a 'teacher' TCM with the same architecture as 
the student but with different, unknown weights w*^. It 
is further assumed that there is no noise on the train- 
ing examples. For learning from random examples, the 
training inputs x^ are sampled randomly from a distri- 
bution Pq{x.). Since the output (|l|) of a TCM is inde- 
pendent of the length of the hidden unit input vectors 
Xi, we assume this distribution Po(x) to be uniform over 
all vectors x"^ = (xj"", . . . ,x^) which obey the spherical 
constraints xf = N/K. 

For query learning, the training inputs x^ are chosen 
to maximize the expected decrease of the entropy S in 
the parameter space of the student. The entropy for a 
given training set e(P) is defined as 



1 



s{q'^p^) = - y dwP(w|e(f))inP(w|e('')). (2) 

For the Gibbs learning algorithm considered here, 
P(w\&^P^) is uniform on the 'version space', the space 
of all students satisfying the spherical constraints wf — 
N/K which predict all training outputs correctly, and 
zero otherwise. Denoting the version space volume by 
F(e(P)), the entropy can thus simply be written as 
SiQ'^py) = lny(e(P)). The entropy decrease AS = 
5(e(P)) - 5(e(P+i)) resulting from the addition of a new 
example {xP'^^ , y^^^) to the existing training set can- 
not be maximized directly, since it depends on the new 
training output y^^^ generated by the unknown teacher. 
Queries are thus chosen to maximize the expected entropy 
decrease, obtained by averaging over y^^^ . Assuming a 
uniform prior over teachers, the probability of a certain 
teacher having produced the training set 0'^' is uniform 
over the version space and zero otherwise. The probabil- 
ity of obtaining output y^"*"^ = ±1 given input xP'^^ is 
therefore simply w± = F(e(P+i))|yp+i=±i/F(e(P)), the 
fraction of the version space left over after the new ex- 
ample {'xP'^^,yP^^ = ±1) has been added This gives 
the expected entropy decrease 

which attains its maximum value In 2 (= 1 bit) when 
= i, i.e., when the new input x^^^ bisects the existing 
version space. This is intuitively reasonable, since — 
i corresponds to maximum uncertainty about the new 
output and hence to maximum information gain once this 
output is known. 

Due to the complex geometry of the version space, the 
generation of queries which achieve exact bisection is in 
general computationally infeasible. The 'query by com- 
mittee' algorithm provides a solution to this problem 
by first sampling a 'committee' of 2k students from the 
Gibbs distribution P(w|e('')) and then using the fraction 
of committee members which predict -1-1 or —1 for the 
output y corresponding to an input x as an approxima- 
tion to the true probability P{y = ±l|x, Q'-p^) = v^. The 
condition = ^ is then approximated by the require- 
ment that exactly k of the committee members predict 
output and the other k predict —1 for the new train- 
ing input x^^^. An approximate minimum entropy query 
can thus be found by sampling (or filtering) inputs from 
a stream of random inputs until this condition is met. 
The procedure is then repeated for each new query. As 
fc — > cx), this algorithm approaches exact bisection, and 
we focus on this limit in the following. 

The main quantity of interest in our analysis is the 
generalization error eg, defined as the probability that a 
given student TCM will predict the output of the teacher 
incorrectly for a random test input sampled from Po(x). 
We consider the thermodynamic limit — > oo at con- 
stant number of training examples per weight, a = p/N, 
and focus on the case of a large number of hidden units. 



K ^ oo with N/K ^ 1. The generalization error then 
takes the form g 

Eg = (I/tt) arccosi?off (3) 
where i?cff is an effective overlap parameter given by 

^«ff = ^E/(^») /(•) = -arcsin(.) 

i=l 

in terms of the overlaps of the student and teacher hidden 
unit weight vectors, Ri — {K/N)wjw^. In the thermo- 
dynamic limit, the Ri are self-averaging, i.e., their values 
for a specific teacher, training set and student from the 
Gibbs distribution are identical to their averages with 
probability one. These averages can be obtained from a 
replica calculation of the average entropy 5' as a function 
of a, following the calculations in Refs. [^,||. We use the 
assumption of replica symmetry, which is believed to be 
exact for the case of noise free training data 1^. The 
replica calculation involves, in addition to the Pj, the 
overlap parameters (/i < p) 

qf = {K/N)i^Pr = (if/A^)(wf)Twf 

where = ("^j)p(w|e(p)) ^^'^ similarly for w^. The q'/P 
arise from the average over the {fi + l)-th of the p train- 
ing examples as the overlaps of the committee members 
which determine the selection of this example with the 
students trained on all p examples. The qf can be de- 
termined from saddle point equations, whereas the 
have to be determined independently. However, given 
the assumption of self-averaging of all overlap parame- 
ters, it can be shown that q'/'P = q^ in the case consid- 
ered here . This relation, which is proved by induction 
from the case p = /i + 1, can be explained intuitively as 
follows. Given the first ^ training examples 0^'^^ the 
teacher can be anywhere in the corresponding version 
space V^. Considering an average over all possible sets 
of training examples ^ + 1 . . .p produced by teachers in 
V^, the student is therefore equally likely to end up in 
any part of after having been trained on the whole 
training set 

e(p). 

We assume symmetry between the hidden units 
i.e., qP = q, q^P = gf = q{a') {a' = ^i/N) and Ri = 
R. The calculation, details of which will be reported 
elsewhere , can be further simplified by exploiting the 
relation R = q, which expresses the symmetry between 
teacher and student (see, e.g., Ref. [^). One then obtains 
the normalized entropy s = S/N (apart from an additive 
constant, which we fix such that s = at a = 0) as the 
saddle point of 

i(g + ln(l-g)) + 2j|| da' J Dz H{jz)\iiH{-fz) (4) 

with respect to q, where 7 = [{qcS-qcff{a'))/{l-qcs)]^^^ 
and (/off — f{q), qcs{oi') — f{q{a')). We have also used 
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the shorthand Dz = dzexp{—^z^)/V2n and H{z) = 
Dx. Differentiating (^) with respect to a, one verifies 
that ds/da — — In 2 as expected for minimum entropy 
queries (the large committee hmit k oo has already 
been taken) 
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FIG. 1. (a) Generalization error eg as a function of the 
normalized number of examples, a, for exact minimum en- 
tropy queries, queries as selected by constructive algorithm, 
and random examples, (b) Log generalization error In eg vs. 
entropy s, for the same three cases. For both queries and ran- 
dom examples, In eg « for large negative values of s (corre- 
sponding to large a). The very small separation between the 
curves is more clearly seen in the inset, which shows In eg — is 
vs. s. 

Solving the saddle point equation numerically, we ob- 
tain the average generalization error as plotted in Fig- 
ure ^(a). For large a, we find that eg oc exp(— ca) 
with c = ^ln2, which can also be confirmed analyti- 
cally from (Q). This exponential decay of the generaliza- 
tion error eg with a provides a marked improvement over 
the eg cx 1/a decay achieved by random examples 
The effect of minimum entropy queries is thus similar to 
what is observed for a binary perceptron learning from 
a binary perceptron teacher, but the decay constant c 
is only half of that for the binary perceptron This 
means that asymptotically, twice as many examples are 
needed for a TCM as for a binary perceptron (when learn- 



ing from a teacher with the respective architecture) to 
achieve the same generalization performance, in agree- 
ment with the corresponding result for random exam- 
ples. Since in both networks, due to the binary nature 
of their outputs, minimum entropy queries lead to an en- 
tropy s = —a In 2, we can also conclude that the large 
a relation s ~ Incg for the binary perceptron has to 
be replaced by s ~ In el for the tree committee machine. 
This relation should hold independently of whether one 
is learning from queries or from random examples. We 
have confirmed this by calculating the entropy for learn- 
ing from random examples and comparing with the cor- 
responding generalization error, as shown in Figure |l|(b). 

The above results are derived in the limit of a large 
number of hidden units, K oo. For large but 
finite K they can be shown to be valid as long as 
the 0(1/ A') correction to the generalization error (||), 
(— l/27rAr)i?off(l — ^cff)^^^' remains negligible, which 
holds for Cg ^ 0{K^^^^). In the opposite regime 
Cg ^ 0(A'~^/^), i.e., for higher a, the generalization er- 
ror Cg « (Ar/8)^/^(I — i?cff) oc arccos(i?) has the same 
functional dependence on R as for the binary percep- 
tron, due to the fact that its dominant contribution arises 
from errors for which student and teacher only differ in 
the output of a single hidden unit. There is therefore 
a cross-over in the large a dependence of eg from TCM 

(AT ^ cxd) to binary perceptron type behaviour around 
eg = 0(i^-V2). 

We now consider the practical realization of minimum 
entropy queries in the TCM. The query by committee 
approach, which in the limit /c — )■ oo is an exact al- 
gorithm for selecting minimum entropy queries, filters 
queries from a stream of random inputs. This leads 
to an exponential increase of the query filtering time 
with the number of training examples that have already 
been learned |]l|. As a computationally cheap alter- 
native we propose a simple algorithm for constructing 
queries, which is based on the assumption of an approx- 
imate decoupling of the entropies of the different hidden 
units, as follows. Each individual hidden unit of a TCM 
can be viewed as a binary perceptron. The distribution 
P(wi|0'P') of its weight vector Wj given a set of train- 
ing examples Q^p^ has an entropy Si associated with it, 
in analogy to the entropy (^) of the full weight distribu- 
tion P(w|8*-^-*). Our 'constructive algorithm' for select- 
ing queries then consists in choosing, for each new query 
x^"*"^, the inputs x^^^ to the individual hidden units in 
such a way as to maximize the decrease in their entropies 
Si. This can be achieved simply by choosing each x^"*"^ 
to be orthogonal to (and otherwise random, i.e., ac- 
cording to Po{x)) 0, thus avoiding the time-consuming 
filtering from a random input stream. In practice, one 
would of course approximate by an average of 2k 
(say) samples from the Gibbs distribution P(w|8''^-'); 
these samples would have been needed anyway in the 
query by committee approach. 

An analysis of the generalization performance achieved 
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by this constructive algorithm proceeds along the same 
line as the calculation for exact minimum entropy 
queries. Again restricting attention to the limit fc — > oo, 
we find that the saddle point expression (jj) for the nor- 
malized entropy s still holds, but with 7 now given by 
7 - K(l - a)]l/^ a = fi[q - qia')]/[l - g(a')]). Dif- 
ferentiating (0) with this replacement with respect to a, 
we find again that ds/da = — In 2, which means that in 
the thermodynamic limit that we consider, queries se- 
lected to minimize the individual hidden units' entropies 
also minimize the overall entropy of the TCM. This may 
seem surprising at first; heuristically, however, one can 
argue that for a large number of hidden units K, the cor- 
relations in the Gibbs distribution between the hidden 
unit weight vectors must be weak, and may indeed be- 
come negligible in the K ^ 00 limit considered here. The 
generalization performance achieved by the constructive 
query algorithm, shown in Figure [l](a), is actually slightly 
superior to that of exact minimum entropy queries as cal- 
culated in the previous section. This decrease in gener- 
alization error, although slight (about 4% for large a), 
exemplifies the fact that while decrease in entropy and 
in generalization error are normally correlated, there is 
no exact one-to-one relationship between them (compare 
the discussion in Ref. |||). Query selection algorithms 
which achieve the same entropy decrease can therefore 
lead to different generalization performance. 

We have found above a modification of the relation- 
ship between entropy s and generalization error eg from 
s « In Eg for the binary perceptron to s « In eg for the 
TCM, and a corresponding change of the decay constant 
c in the asymptotic behaviour of the generalization er- 
ror eg cx exp(— ca). This leads to the interesting ques- 
tion of the value of c in more general multi-layer neural 
networks, and in particular its dependence on the num- 
ber of hidden units K. The bound in Ref. [Q, derived 
for the k = 1 query by committee algorithm, implies a 
lower bound on c which scales inversely with the VC- 
dimension j|] of the class of networks considered. Taking 
the storage capacity of a network as a coarse measure of 
its VC-dimension, one would then conclude from exist- 
ing bounds that c could be as small as 0(1/ In if) 
for large K. However, the existing results for the ca- 
pacity of particular networks like the TCM are not un- 
ambiguous enough to decide whether realistic networks 
would saturate this bound. Furthermore, it has been 
argued previously ||l^ that both the input space dimen- 
sion and the VC-dimension determine the a-dependence 
of the generalization error. Replacing the VC-dimension 
in the bound in Ref. [0 with the input space dimension, 
one would then obtain a c of 0(1) independently of K. 
More theoretical work is clearly needed to clarify these 
questions. 

With regard to the practical application of query learn- 
ing in realistic multi-layer neural networks, the results we 
have obtained for a constructive query algorithm based 
on the assumption of a decoupling of the entropies of 
individual hidden units are encouraging. For example, 



the proposed constructive algorithm can be modified for 
query learning in a fully-connected committee machine 
(where each hidden unit is connected to all the inputs), 
by simply choosing each new query to be orthogonal to 
the subspace spanned by the average weight vectors of all 
K hidden units. As long as K is much smaller than the 
input dimension N , and assuming that for large enough 
K the approximate decoupling of the hidden unit en- 
tropies still holds for fully connected networks, one would 
expect this algorithm to yield a good approximation to 
minimum entropy queries p^ . It is an open question 
whether this conclusion would also hold for a general two- 
layer network with threshold units (where the hidden-to- 
output weights are also free parameters), which can ap- 
proximate a large class of input-output mappings. We 
are currently investigating these issues in order to assess 
whether the significant improvements in generalization 
performance achieved by minimum entropy queries can 
be made available, in a computationally cheap manner, 
for learning in realistic binary output multi-layer neural 
networks. 
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